Automatic sentence segmentation for classical Chinese: <i>The Spring and Autumn Annals</i> as an example

نویسندگان

چکیده

Abstract There exists no sentence boundary in most classical Chinese literature texts. Since it is difficult to read of this kind, experts or linguistics would segment the manually. This article explores effectiveness segmentation method so as provide a reference for punctuation. On basis machine learning methods, we chose three components learning, namely models, tagging schemes, and features, compare results. The models include conditional random field (CRF) long short term memory (LSTM) BiLSTM–CRF Bidirectional Encoder Representation from Transformers (BERT) models. are five schemes features including statistical feature, Guangyun, Fanqie. Finally, performance combined feature template evaluated by ten-fold cross-validation on four texts different genres. SikuBERT model proved be effective at present. Different various introduced. results show that 5-tag-J can improve performance. Statistical an important clue segmentation, useful related tasks, but Guangyun Fanqie have little impact. Other factors genres writing styles.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classical Chinese Sentence Segmentation

Sentence segmentation is a fundamental issue in Classical Chinese language processing. To facilitate reading and processing of the raw Classical Chinese data, we propose a statistical method to split unstructured Classical Chinese text into smaller pieces such as sentences and clauses. The segmenter based on the conditional random field (CRF) model is tested under different tagging schemes and ...

متن کامل

Chinese sentence segmentation as comma classification

We describe a method for disambiguating Chinese commas that is central to Chinese sentence segmentation. Chinese sentence segmentation is viewed as the detection of loosely coordinated clauses separated by commas. Trained and tested on data derived from the Chinese Treebank, our model achieves a classification accuracy of close to 90% overall, which translates to an F1 score of 70% for detectin...

متن کامل

Autumn spring.

artists; signatures and monograms, an international directory. Castagno, John. Scarecrow Pr., ©2007 567 p. $225.00 Castagno, a multimedia artist and sculptor and author of other books in the Artists’ Signatures and Monograms series, presents a reference for identifying, authenticating, and verifying signatures and works of wellknown and little-known abstract artists. About 2,300 painters, print...

متن کامل

A New Model for Automatic Sentence Segmentation

Context Overlapping Model (COM) is presented in this article for the task of Automatic Sentence Segmentation (ASS). Comparing with HMM, COM expands observation from single word unit to n-gram unit and there is an overlapping part between the neighboring units. Due to the co-occurrence constraint and transition constraint, COM model reduces the search space and improves accuracy of segmentation....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Digital Scholarship in the Humanities

سال: 2023

ISSN: ['2055-7671', '2055-768X']

DOI: https://doi.org/10.1093/llc/fqad016